DES hardware throughput for short operations

ABSTRACT

A symmetric key cryptographic method is provided for short operations. The method includes batching a plurality of operation parameters ( 1503 ), and performing an operation according to a corresponding operation parameter ( 1505 ). The symmetric key cryptographic method is a Data Encryption Standard (DES) method. The short operations can be less than about 80 bytes. The short operations can be between 8 and 80 bytes. The method includes reading the batched parameters from a dynamic random access memory ( 1504 ), and transmitting each operation through a DES engine according to the operations parameter ( 1505 ).

This a non-provisional application claiming the benefit of provisionalapplication Ser. No. 60/201,002, filed May 1, 2000.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to cryptographic support, and moreparticularly to cryptographic support for short operations.

2. Discussion of Prior Art

Data Encryption Standard (DES) is a widely-used method of dataencryption using private keys. There are 72 quadrillion or more possibleencryption keys under the DES that can be used for protecting packetsbetween parties over electronic networks. For each packet or message, akey is chosen at random. Like other symmetric key cryptographic methods,both the sender and receiver need to know and use the same private key.

DES applies a 56-bit key to each 64-bit block of data. The process canrun several modes and includes 16 rounds of operations. Although this isconsidered strong encryption, many companies use triple-DES (TDES),which applies three keys in succession to each packet.

DES originated at IBM in 1977 and was adopted by the U.S. Department ofDefense. It is specified in the ANSI X3.92 and X3.106 standards and inthe Federal Information Processing Standards (FIPS) 46 and 81 standards.

Typically, cryptographic methods focus on large packets (greater thanabout 80 bytes). However, when a DES system is used for smaller packets,the performance may drop by an order of magnitude.

Therefore a need exists for a system and method of cryptographic supportfor DES operations which has high throughput for long (>80 bytes) andshorter packets.

SUMMARY OF THE INVENTION

According to an embodiment of the present invention, a symmetric keycryptographic method is provided for short operations. The methodincludes batching a plurality of operation parameters, and performing anoperation according to a corresponding operation parameter. Thesymmetric key cryptographic method is a Data Encryption Standard (DES)method. The short operations can be less than about 80 bytes. The shortoperations can be between 8 and 80 bytes.

The method includes batching the plurality of operation parameters and aplurality of DES operation into a single request, calling DES for eachoperation in the request, and performing DES for each operationseparately according to the corresponding operation parameter.

The method further includes batching the plurality of operationparameters and a plurality of DES operations into a single request,calling DES for the batched operations, and performing DES for eachoperation separately according to the corresponding operation parameter.Each request is performed with a chip reset, a key and an initializationvector. Calling the DES for the batched operations further comprisesswitching a context for the batched operations. The context switch isbetween an application layer and a system software layer.

The method includes reading the batched parameters from a dynamic randomaccess memory, and transmitting each operation through a DES engineaccording to the operations parameter.

According to an embodiment of the present invention, a method isprovided for improved DES short operation throughput. The methodincludes batching a plurality of operation parameters, each operationparameter corresponding to an operation, reading the batched operationparameters into a dynamic random access memory, and transmitting eachoperation through a DES engine according to the operations parameter.The DES is external-to-external and an output for each operation istransmitted separately. The short operation can be less than about 80bytes. The short operation can be between 8 and 80 bytes.

According to an embodiment of the present invention, a symmetric keycryptographic method is provided for operations between about 8 andabout 80 bytes in length. The method includes providing a key index toan engine, and pumping the operations through the engine in bulk whereina central processing unit does not handle the bytes. The engine is a DESengine.

The method includes resetting an engine chip for an operation, readingan initialization vector, and loading the initialization vector into theengine chip. The method further includes determining a key from the keyindex, loading the key into the engine chip, and reading a data lengthfor the operation.

The method includes transmitting the data length through an Inputchannel into the engine chip, and transmitting the data length throughan Output channel. The channels are FIFOs.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will be described belowin more detail, with reference to the accompanying drawings:

FIG. 1 is a diagram of the DES architecture according to an embodimentof the present invention;

FIG. 2 is another diagram of the DES architecture according to anembodiment of the present invention;

FIG. 3 is still another diagram of the DES architecture according to anembodiment of the present invention;

FIG. 4, is yet another diagram of the DES architecture according to anembodiment of the present invention;

FIG. 5 is a diagram of the FIFO structure supporting DES/TDES with acoprocessor according to an embodiment of the present invention;

FIG. 6 is another diagram of the FIFO structure supporting DES/TDES witha coprocessor according to an embodiment of the present invention;

FIG. 7 is still another diagram of the FIFO structure supportingDES/TDES with a coprocessor according to an embodiment of the presentinvention;

FIG. 8 is yet another diagram of the FIFO structure supporting DES/TDESwith a coprocessor according to an embodiment of the present invention;

FIG. 9 is a further diagram of the FIFO structure supporting DES/TDESwith a coprocessor according to an embodiment of the present invention;

FIG. 10 is a diagram of the FIFO structure supporting DES/TDES with acoprocessor according to an embodiment of the present invention;

FIG. 11 is a flow diagram of an application handling two operations asseparate sccRequests according to the prior art;

FIG. 12 is a flow diagram illustrating a batched host-card interactionaccording to an embodiment of the present invention;

FIG. 13 is a flow diagram of multiple operations batched into a singlecall according to an embodiment of the present invention;

FIG. 14 is a flow diagram of a method which reduces data transfers foreach operation according to an embodiment of the present invention;

FIG. 15 is a flow diagram of a method which batches parameters for alloperations into a block according to an embodiment of the presentinvention; and

FIG. 16 is a graph illustrating DES speeds for various embodiments ofthe present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present invention provides a system and method for cryptographicsupport which has high throughput for long and short DES operations.According to an embodiment of the present invention, the system includesa multi-chip embedded module, packaged in a Peripheral ComponentInterconnect (PCI) card. In addition to cryptographic hardware andcircuitry for tamper detection and response, a general-purpose computingenvironment is provided including a central processing unit, andexecuting software stored in ROM and/or Flash memory.

Referring to FIG. 1, the multiple-layer software architecture of theclient 101 and the host 105 is shown. The client-side includesfoundational security control in Layers 0 and 1 102, a supervisor-levelsoftware system in Layer 2 103, and a user-level software application inLayer 3 104. Layer 2 103 supports application development. Within Layer2 103, a kernel provides the operating system abstractions of multipleprocesses and address spaces; these abstractions support independentmanagers, which handle cryptographic hardware and other input/output(I/O) on the bottom, and provide higher-level application programinterfaces (APIs) to the Layer 3 application 104. An API is the specificmethod prescribed by a computer or by another program by which aprogrammer writing an application program can make requests of theoperating system or another application. Typically, the Layer 3application 104 in turn provides an abstraction of its own API to ahost-side application 107.

The host-side 105 includes a device driver 106 and a host application107. According to FIG. 2, for the Layer 3 application 104 to use aservice provided by the card-side application, the host-side application107 issues a call to the host-side device driver 106. The device driver105 opens an sccRequest 108 to the Layer 2 system 103 on the device.Layer 2 103 informs the Layer 3 application 104 resident on the deviceof the existence of the request, and the parameters the host sent alongwith the request.

According to FIGS. 3 and 4, the Layer 3 application 104 handles the hostapplication's request for service, for example, it can direct Layer 2103 to transfer data 109 to the device driver 106 and perform the neededcryptographic operations. The Layer 3 application 104 closes out thesccRequest 110 and sends the output back 111 to the host application107.

According to an embodiment of the present invention, a device for fastcryptography is provided. The device includes a coprocessor having acentral processing unit (CPU), at least two levels of internal softwareand at least three data paths. The software levels can include anoperation system or kernal level and an application level. The datapaths can include an external to internal memory and/or CPU path, aninternal memory and/or CPU to a symmetric engine path, and a channelbetween the external system and the symmetric engine. The channel can bea first-in first-out (FIFO). According to an embodiment of the presentinvention, the device includes a FIFO state machine. The FIFO statemachine structure transports or drives data into and out of the methodengine.

It should be noted that while the present invention is presented interms of a symmetric cryptographic function (e.g., DES), the inventioncontemplates any parameterized function on variable length data. Thus,DES is provided as an example of an embodiment of the present inventionand given the teachings of the present invention provided herein, one ofordinary skill in the related art will be able to contemplate these andsimilar implementations or configurations of the present invention.

Referring to FIG. 5, the FIFO structure works with the DES/TDES engine500. The present invention is described according to an IBM 4758coprocessor, specifically Models 002/023 PCI cryptographic coprocessors,however, given the teachings of the present invention provided herein,one of ordinary skill in the related art will be able to contemplatethese and similar implementations or configurations.

In Model 2 hardware, the FIFO structure also supports fast Secure HashAlgorithm 1 (SHA-1); though the structure may be applied to any methodengine.

For both input and output, two pairs of FIFOs 501-504, a PCI FIFO pair501-502 and an internal FIFO pair 503-504 are provided for external andinternal transfer, respectively, as well as a Direct Memory Access (DMA)controller 505-506 for CPU-free transfer into and out of internaldynamic random access memory (DRAM) 507.

The internal CPU 508 selects which data paths to activate, and what key,initialization vector (IV), and other operational parameters the DESengine 500 may use, via control registers (not shown). The IV isgenerated by a random number generator, typically included in the Layer2 system, and combined with the unencryted text and the key. The key isa variable value applied to a block of unencrypted text to produceencrypted text.

Configurations of the DES engine 500 include bulk external-to-externalDES (shown in FIG. 8), bulk internal-to-internal DES (output DMA 506 tointernal input FIFO 503 to DES 500, then back through the InternalOutput FIFO 504 and PCI Output FIFO 502), and DMA transfer (e.g., PCIinput FIFO 501 to internal input FIFO 503 to input DMA 505 and from theOutput DMA Controller 506 to the Internal Output FIFO 504 and to the PCIOutput FIFO 502). Further, the DES hardware can be configured in abypass mode in which the commercial Layer 2 system does not use thehardware.

One constraint on the system is that either both internal FIFO-DES pathsneed be selected (bulk mode), or neither is to be selected. Anotherconstraint is that the FIFO configurations cannot be altered until datatransfer is paused, and the state machine driving the FIFOs willtransfer data asynchronously until resources are exhausted.

The internal CPU 508 can configure the FIFO hardware to support cardapplications in various ways. For example, FIG. 6 depicts aconfiguration in which the FIFOs bring data into the card via the DMA,such as when the host application opens up a sccRequest to the cardapplication. Data passes from the PCI Input FIFO 501 to the InternalInput FIFO 503 via 601, to the Input DMA Controller 505 via 602, to theDRAM 507 via 603 and 604.

Referring to FIG. 7 depicting a DES request, the card may transfer theoperational parameters from the DRAM 507 into the DES chip 500. Theinternal CPU 508 loading operational parameters into the DES chip 500from the DRAM 507 via lines 701-703.

According to FIG. 8, if the DES request is for external-to-external DES,the card will configure the FIFOs to bring the data in from the host,through the DES chip 500 and back to the host. The CPU 508 can configurethe FIFOs 501-504 to stream data from the host, through the DES chip andback to the host via lines 801-804.

Additionally, if the DES request is for internal-to-internal DES and isdetermined to be too short for DMA, the card may manually push the databytes through. The CPU 508 can drive data from the DRAM 507 through theDES/TDES engine via programmed I/O and lines 901-904.

As depicted in FIG. 10, when the sccRequest is complete, the card maysend the results back to the host via DMA. The internal CPU 508 canconfigure the FIFOs to send data from the DRAM 507 back to the host viathe DMA and lines 1001-1004.

The present invention proposes methods for increasing the throughput ofshort DES operations. The methods used for evaluating the presentinvention included, DES operations including cipher block chaining (CBC)encrypt and CBC-decrypt, with data sizes distributed uniformly at randombetween 8 and 80 bytes. Chaining is a method which depends thedecryption of a block of cipher text on all preceding blocks. The IVsand keys changed with each operation; the keys are tripple-DES (TDES)encrypted with a master key stored inside the device. Encrypted keys,IVs and other operational parameters are sent in with each operation,but are not counted as part of the data throughput. Although the keysmay change with each operation, the total number of keys is small,relative to the number of requests. Referring to FIG. 16, the speedsobtained for DES operations are shown for various embodiments of thepresent invention. Using Model 1 hardware a speed indicated by 1601 wasachieved.

A baseline implementation was established using a Model 2 prototype forthe following embodiments. According to FIG. 11, the host applicationhandles each operation 1101-1102 as a separate sccRequest 1103-1104 withProgrammed Input/Output (PIO) DES. The implementation includes the hostapplication which generates sequences of short-DES requests (cipher key,IV, data) and the card-side application. The card-side applicationcatches each request, unpacks the key, sends the data, key, and IV tothe DES engine, and sends the results back to the host. Keys wererandomly chosen over a set of cipher keys. Caching keys inside the cardreduced the extra TDES key decryption step and increased the speed 1602.

According to an embodiment of the present invention, the short-DESperformance can be enhanced by reducing the host-card interaction.Referring to FIG. 12, this includes batching a large sequence ofshort-DES requests into one sccRequest 1201. The card-side applicationwas modified accordingly to receive the sequence in one step, processeach operation 1202-1205, and send the concatenated output back to thehost in one step 1206. The Layer 3 application calls DES for eachoperation 1202 and 1204. Layer 2 performs the DES for each operationseparately 1203 and 1205. Speeds obtained for the benchmark data abovewhere between about 18 to 23 kilobytes/second and up to 40kilobytes/second with key catching 1603.

According to an embodiment of the present invention, by eliminating theDES chip reset for each operation the short-DES performance may beincreased 1604. By generating a sequence of short-DES operation requeststhat all use one key, one direction (decrypt or encrypt), and IVs ofzero (although the IVs may be arbitrary), a speed of about 360kilobytes/second can be achieved. The card-side application receives theoperation sequence and sends the operation sequence to the Layer 2system. In Layer 2, a modified DES Manager (the component controllingthe DES hardware) sets up the chip with the key and an IV of zero, andtransmits the data through the chip. The end of each operation, the DESManager performs an exclusive-or (XOR) to break the chaining. Forexample, for encryption, the software manually XOR's the last block ofcipher text from the previous operation with the first block of plaintext for the operation, in order to cancel out the XOR that the chipwould do.

According to the batching method, besides reducing the number of chipresets, the number of context switches between the Layer 3 and Layer 2is reduced from O(n) to O(1), where n is the number of operations in thebatch. Referring to FIG. 13, according to another embodiment of thepresent invention, by using the multi-key, non-zero-IV setup (resultsshown as 1603), the card-side application 1302 was altered to sendbatched requests 1301 to a modified DES Manager (Layer 2) 1303-1304,thus reducing the number of context switches. The card-side application1302 calls DES for the batched operations. The modified DES Manager1303-1304 processes each request with a chip reset and a new key and IV.The requests are sent to the host 1305. The results obtained using themodified DES Manager 1303-1304 are shown as 1604 in FIG. 16.

According to yet another embodiment of the present invention, the FIFOstate machine pumps data bytes through DES in a bulk mode. Thus, the CPUdoes not handle the data bytes. According to the prior methods, eachbyte of the cipher key, IV, and data was handled many times. The bytescame in via FIFOs and DMA into the DRAM with an initial sccRequestbuffer transfer. The CPU takes the bytes out of DRAM and puts them intothe DES chip. The CPU takes the data out of the DES chip and puts itback into DRAM. The CPU sends the data back to the host through theFIFOs. Accordingly, by reducing the number of data transfers thethroughput can be increased 1605. Key unpacking is eliminated as abuilt-in part of the API. Each application may have a unique method ofunpacking, making the API unpacking redundant. Within each applicationan initialization step concludes with a plain text key table resident inthe device DRAM. The operation lengths were standardized to 40 bytes. Inaddition, the host application was modified to generate sequences ofrequests that include an index into the internal key table, instead of acipher key. Thus, the card-side application 1401 calls the modified DESManger 1402 and 1407 and makes the key table 1403 and 1408 available toit, rather than immediately bringing the request sequence from the PCIInput FIFO into DRAM. For each operation the modified DES Manager 1402and 1407 resets the DES chip; reads the IV and loads it into the chip;reads and sanity checks the key table, looks up the key, and loads itinto the chip; and reads the data length for the operation. The modifiedDES Manager sets up the state machine to transmit that number of bytesthrough the Input FIFOs into the DES chip then back out the Output FIFOs1404-1406 and 1409-1411. The card-side application closes out therequest 1412. The results are shown as 1605 in FIG. 16.

According to an embodiment of the present invention, The number ofIndustry Standard Architecture (ISA) I/O instructions was increased(doubled) which reduced the throughput by half, showing a correlationbetween the ISA I/O instructions and the throughput speed. The modifiedDES Manager described above (with respect to 1605 and FIG. 14) was thenmodified to use memory-mapping I/O ports instead of ISA I/O whenavailable (the hardware used did not provide memory mapped I/O ports forall instances). The software was also modified to eliminate any spuriousFIFO reads caused by certain state machine polling intermittently. Theresults are shown as 1606 in FIG. 16.

Referring to FIG. 15, by batching the parameters together, theparameters can be read via memory-mapped operations, allowingmodification of the FIFO configuration and the processing of the data.Layer 3 calls DES for the batched operations 1501. The host applicationbatches the per-operation parameters into one group 1503, attached tothe input data. The modified DES Manager sets up the Internal FIFOs andthe state machine to read the batched parameters, by-passing the DESchip 1502; reads the batched parameters via memory-mapped I/O from theInternal Output FIFO into DRAM 1504 and 1508; reconfigures the FIFOs;and, using the buffered parameters, sets up the state machine and theDES chip to transmit each operation's data 1506 and 1510 from the inputFIFOs, through the DES, then back out the Output FIFOs 1505, 1507 and1509 and 1511. Layer 3 closes out the request 1512. The results areshown in 1607 in FIG. 16. The accuracy of the method may be increased byaccessing the IV and data length registers through the ISA method 1608.

According to the present invention, the short-DES speed can bedetermined according to the following relationship:

$\frac{{C_{1} \cdot {Batches}} + {C_{2} \cdot {Batches} \cdot {Ops}} + {C_{3} \cdot {Batches} \cdot {Ops} \cdot {DataLen}}}{{Batches} \cdot {Ops} \cdot {DataLen}}$where Batches is the number of host-card batches, Ops is the number ofoperations per batch, DataLen is the average data length per operation,and C₁, C₂, and C₃ are unknown constants representing the per-batchper-operation and per-byte overheads, respectively.

The present invention contemplates eliminating the per-batch overhead C₁by modifying the host device driver-Layer 2 interaction to enableindefinite sccRequest, with added polling or signaling to indicate whenadditional data is ready for transfer. The per-operation overhead C₂ maybe reduced by minimizing the number of per-operation parametertransfers. For example, the host application may, within a batch ofoperations, interleave parameter blocks that assert arguments such as,the next N operations all use a particular key. This method eliminatesbringing in and reading the key index for each iteration. Anotherexample can includes the host application processing the IVs before orafter transmitting the data to the card. This is not a security issue ifthe host application is trusted to provide the IVs. The methodeliminates bringing in the IVs and, because the DES chip has a defaultIV of zeros after reset, eliminates loading the IVs.

According to another embodiment of the present invention, per-operationoverhead may be reduced by redesigning the FIFOs and the state machine.By modifying the DES engine to expect data-input to include parametersinterleaved with data, then the per-operation overhead C₂ may approachthe per-byte overhead C₃. The state machine handles fewer output bytesthan input bytes and the CPU controls the class of engine operationsover which the parameters, for example, chosen externally, are allowedto range. For example, the external entity:may be allowed to choose onlycertain types of encryption operations. Further, the CPU may insertindirection on the parameters the external entity chooses and theparameters the engine see, e.g., the external entity provides an indexinto an internal table.

Having described embodiments of a system and method of cryptography, itis noted that modifications and variations can be made by personsskilled in the art in light of the above teachings. It is therefore tobe understood that changes may be made in the particular embodiments ofthe invention disclosed which are within the scope and spirit of theinvention as defined by the appended claims. Having thus described theinvention with the details and particularity required by the patentlaws, what is claims and desired protected by Letters Patent is setforth in the appended claims.

1. A symmetric key cryptographic method for a plurality of operationscomprising the steps of: batching a plurality of operation parameters;calling data encryption for the plurality of operations corresponding toa batch of the plurality of operation parameters; reading the batch ofthe plurality of operations into memory, wherein the batch does notinclude data; and performing the plurality of operations correspondingto the batch of the plurality of operation parameters individually,wherein performing the plurality of operations comprises reading thebatched parameters from the memory, and transmitting each operationthrough a data encryption standard (DES) engine according to theoperations parameter, wherein a first-in-first-out (FIFO) state machinetransmits the data for each operation to the DES engine individually andseparately from the batch for an external-external DES operation.
 2. Themethod of claim 1, wherein the symmetric key cryptographic method is aData Encryption Standard (DES) method.
 3. The method of claim 1, whereinthe plurality of operations are less than about 80 bytes.
 4. The methodof claim 1, wherein the plurality of operations are between 8 and 80bytes.
 5. A method for improved DES short operation throughputcomprising the steps of: batching a plurality of operation parameters,each operation parameter corresponding to an operation; reading thebatched operation parameters into a dynamic random access memory; andtransmitting each operation through a DES engine according to theoperations parameter, wherein transmitting each operation through theDES engine comprises supplying data for each operation individually andseparately from the batched operation parameters to the DES engines,wherein the DES is external-to-external and an output for each operationis transmitted separately.
 6. The method of claim 5, wherein the shortoperation is less than about 80 bytes.
 7. The method of claim 5, whereinthe short operation is between 8 and 80 bytes.